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METHOD AND APPARATUS FOR BATCHED NETWORK SECURITY 
PROTECTION SERVER PERFORMANCE 

RELATED APPLICATIONS 

This application claims the benefit of United States Provisional 
Application Number 60/21 1 ,023 filed 06/12/2000, and Application Number 
60/21 1,03 1, filed 6/12/2000, both of which are incorporated herein by reference. 

FIELD OF THE INVENTION 

The claimed invention relates to the field of secure communications. 
More particularly it relates to improving the efficiency of establishing secure 
network communications. 

BACKGROUND OF THE INVENTION 

Many network transactions require secure communications. The Secure 
Socket Layer ("SSL") is the most widely deployed protocol for securing 
communication on the World Wide Web ("WWW"). SSL along with other 
protocols such as Transport Layer Security ("TLS") are used by E-commerce 
and financial web sites to guarantee privacy and authenticity of information 
exchanged between a web server and a web browser. Currently, the number of 
web sites using SSL and TLS to secure web traffic is growing at a phenomenal 
rate and as the services provided on the World Wide Web continue to expand so 
will the need to establish secure connections. 
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Unfortunately, SSL and TLS are not cheap, A number of studies show 
that web servers using these protocols perform far worse than web servers that 
do not encrypt web traffic. In particular, a web server using SSL can handle 30 
to 50 times fewer transactions per second than a web server using clear-text 
5 communication only. The exact transaction performance degradation depends 

on the type of web server used by the site and the security protocol 
implemented. To overcome this degradation web sites typically buy 
significantly more hardware in order to provide a reasonable response time to 
their customers. 

10 Web sites often use one of two techniques to overcome secure 

communication's impact on performance. The first method, as indicated above, 
is to deploy more machines at the web site and load balance connections across 
these machines. This is problematic since more machines are harder to 
administer. In addition, mean time between failures decreases significantly. 

15 The other solution is to install a hardware acceleration card inside the web 

server. The card handles most of the secure network workload thus enabling the 
web server to focus on its regular tasks. Accelerator cards are available from a 
number of vendors and while these cards reduce the penalty of using secure 
connections, they are relatively expensive and are non-trivial to configure. Thus 

20 there is a need to establish secure communications on a network at a lower cost. 
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SUMMARY OF THE INVENTION 

A method and apparatus for batching secure communications in a 
computer network are provided. When a web browser first connects to a web 
5 server using secure protocols, the browser and server execute an initial 

handshake protocol. The outcome of this protocol is a session encryption key 
and a session integrity key. These keys are only known to the web server and 
web browser, and establish a secure session. 

Once session keys are established, the browser and server begin 

10 exchanging data. The data is encrypted using the session encryption key and 

protected from tampering using the session integrity key. When the browser 
and server are done exchanging data the connection between them is closed. 

The establishment of a secure session using a protocol such as Secure 
Socket Layer ("SSL") begins when the web browser connects to the web server 

1 5 and sends a client-hello message. Soon after receiving the message, the web 

server responds with a server-hello message. This message contains the server's 
public key certificate that informs the client of the server's Rivest-Shamir- 
Adleman algorithm ("RS A") public key. Having received the public key, the 
browser picks a random 48-byte string, R, and encrypts it using the key. Letting 

20 C be the resulting cipher-text of the string R, the web browser then sends a 

client-key-exchange message containing C. The 48-byte string R is called the 
pre- master- secret. Upon receiving the message, from the browser, the web 
server uses its RSA private key to decrypt C and thus learns R. Both the 
browser and server then use R and some other common information to derive 
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the session keys. With the session keys established, encrypted message can be 

sent between the browser and server with impunity. 

The decryption of the encrypted string, R, is the expensive part of the 

initial handshake. An RS A public key is made of two integers (N 9 e). In an 
5 embodiment N = pq is the product of two large primes and is typically 1024 bits 

long. The value e is called the encryption exponent and is typically some small 

number such as e = 65537. Both N and e are embedded in the server's public 

key certificate. The RSA private key is simply an integer d satisfying e * d — 1 

mod (p - 1) (q ~ 1). Given an RSA cipher-text Q the web server decrypts C by 
10 using its private key to compute Cf mod N that reveals the plain-text message, 

R, Since both d and N are large numbers (each 1024 bits long) this computation 

takes some effort. 

At a later time, the browser may reconnect to the same web server. 

When this happens the browser and server execute the SSL resume handshake 
15 protocol. This protocol causes both server and browser to reuse the session 

keys established during the initial handshake saving invaluable resources. All 

application data is then encrypted and protected using the previously established 

session keys. 

Of the three phases, the initial handshake is often the reason why SSL 
20 degrades web server performance. During this initial handshake the server 

performs an RSA decryption or an RSA signature generation. Both operations 
are relatively expensive and the high cost of the initial handshake is the main 
reason for supporting the resume handshake protocol. The resume handshake 
protocol tries to alleviate the cost of the initial handshake by reusing previously 
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negotiated keys across multiple connections. However, in the web 
environment, where new users constantly connect to the web server, the 
expensive initial handshake must be executed over and over again at a high 
frequency. Hence, the need for reducing the cost of the initial handshake 
5 protocol. 

One embodiment presents an implementation of batch RSA in an SSL 
web server while other embodiments present substantial improvements to the 
basic batch RSA decryption algorithms. These embodiments show how to 
reduce the number of inversions in the batch tree to a single inversion. Another 

1 0 embodiment further speeds up the process by proper use of the Chinese 

Remainder Theorem ("CRT") and simultaneous multiple exponentiation. 

A different embodiment entails an architecture for building a batching 
SSL web server. The architecture in this embodiment is based on using a 
batching server process that functions as a fast decryption oracle for the main 

1 5 web server processes. The batching server process includes a scheduling 

algorithm to determine which subset of pending requests to batch. 

Yet other embodiments improve the performance of establishing secure 
connections by reducing the handshake work on the server per connection. One 
technique supports web browsers that deal with a large encryption exponent in 

20 the server's certificate, while another approach supports any browser. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example in the following 
figures in which like references indicate similar elements. The following 
5 figures disclose various embodiments of the claimed invention for purposes of 

illustration only and are not intended to limit the scope of the claimed invention. 

Figure 1 is a flow diagram of the initial handshake between a web 
server and a client of an embodiment. 

Figure 2 is a block diagram of an embodiment of a network system for 
10 improving secure communications. 

Figure 3 is a flow diagram for managing multiple certificates using a 
batching architecture of an embodiment. 

Figure 4 is a flow diagram of batching encrypted messages prior to 
decryption in an embodiment. 
15 Figure 5 is a flow diagram for increasing efficiency of the initial 

handshake process by utilizing cheap keys in an embodiment. 

Figure 6 is a flow diagram for increasing efficiency of the initial 
encryption handshake by utilizing square keys in an embodiment. 
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The establishment of a secure connection between a server and a 
browser can be improved by batching the initial handshakes on the web server. 

5 In one embodiment the web server waits until it receives b handshake requests 

from b different clients. It treats these b handshakes as a batch, or set of 
handshakes, and performs the necessary computations for all b handshakes at 
once. Results show that, for b = A, batching the Secure Socket Layer ("SSL") 
handshakes in this way results in a factor of 2.5 speedup over doing the b 

10 handshakes sequentially, without requiring any additional hardware. While the 

Secure Socket Layer protocol is a widely utilized technique for establishing a 
secure network connection, it should be understood that the techniques 
described herein can be applied to the establishment of any secure network- 
based connection using any of a number protocols. 

1 5 One embodiment improves upon a technique developed by Fiat for batch 

RS A decryption. Fiat suggested that decrypting multiple RSA cipher-texts as a 
batch would be faster than decrypting them one by one. Unfortunately, 
experiments show that Fiat's basic algorithm, naively implemented, does not 
give much improvement for key sizes commonly used in SSL and other network 

20 security protection handshakes. 

A batching web server must manage multiple public key certificates. 
Consequently, a batching web server must employ a scheduling algorithm that 
assigns certificates to incoming connections, and picks batches from pending 
requests, so as to optimize server performance. 
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To encrypt a message Musing an RSA public key (N 9 e), the message M 
is formatted to obtain an integer X in {1, ... , AT}. This formatting is often done 
using the PKCS1 standard. The cipher-text is then computed as C = X 6 mod N. 
This process occurs during the initial stages of the initial handshake between a 
5 client and server when attempting to create a secure connection. 

To decrypt a cipher-text C the web server uses its private key d to 
compute the e !th root of C in Z N . The e th root of C is given by Cf mod N as 
previously noted. Since both d and TV are large numbers (each 1 024 bits long) 
this is a lengthy computation on the web server. It is noted that d must be taken 
10 as a large number (i.e., on the order of AO since otherwise the RSA system is 

insecure. 

The general process in establishing a Secure Socket Layer 
communication between a browser or client and a server or host is depicted in 
Figure 1 . The process begins with a request from the browser to establish a 

1 5 secure session 110. The client forms a hello message requesting a public key 

and transmits the message to the server 114. Upon receiving the client-hello 
message, the web server responds with a server-hello message containing a 
public key 118. The public key is one half of a public / private key pair. While 
the server transmits the public key back to the browser the server keeps the 

20 private key. Once the client receives the public key 1 22 a random number R is 

generated 126. This random number is the session key. The client encrypts R 
by using the private key that it received from the server 132. With the number 
R encrypted, the client sends the cipher-text to the web-server 138. Upon 
receiving the cipher-text 142 the web server user the private key portion of the 

C:\NrPortbl\PALIBl\MCM\1290290_19.DOC (65282) -8- 



public / private key pair to decrypt the cipher-text 146. With both the client and 

the server possessing the session key R, a new encrypted secure socket layer 

session 160 is established using R as the session key 158. This session is truly 

encrypted since only the client and the web server possess the session key for 

encryption and decryption. 

When using small public exponents, e\ and e 2 , which are components of 

the public key, it is possible to decrypt two cipher-texts for approximately the 

price of one. Suppose v x is a cipher-text obtained by encrypting using the 

public key (N, 3>. Similarly, imagine v 2 is a cipher-text obtained by encrypting 

using the public key (N, 5). To decrypt vi and v 2( computing v/ 73 and vi 1/5 mod 

TV are made by setting^ - (vi 5 * v 2 3 ) 1/15 it can be shown that 

j 10 j6 
Vl 1/3 =— andv* /5 = — 

1 3 2 2 2 

vf -V 2 Vj -v 2 

Hence, at the cost of computing a single 1 5 th root both vi and v 2 can be 

decrypted. 

This batching technique is most useful when the public exponents e\ and 
e 2 are very small (e.g., 3 and 5). Otherwise, the extra arithmetic required can be 
expensive. Also, only cipher-texts encrypted using distinct public exponents 
can be batch decrypted. Indeed, it can be shown that it is not possible to batch 
when the same public key is used. That is, it is not possible to batch the 
computation of vi 1/3 and v 2 1/3 . 

This observation to the decryption of a batch of b RSA cipher-texts can 
be generalized. In one embodiment there are b distinct and pairwise relatively 
prime public keys e u . . . all sharing a common modulus N=pq. 
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Furthermore, assume there are b encrypted messages, v h . . . , v br one encrypted 
with each key, that are desirable to decrypt simultaneously, to obtain the plain- 
texts m . = v } fei - 

The batch process is implemented around a complete binary tree with b 
leaves, possessing the additional property that every inner node has two 
children. In one embodiment the notation is biased towards expressing locally 
recursive algorithms: Values are percolated up and down the tree. With one 
exception, quantities subscripted by L or R refer to the corresponding value of 
the left or right child of the node, respectively. For example, m is the value of m 
at a node; m R is the value of m at that node f s right child and so forth. 

Certain values necessary to batching depend on the particular placement 
of keys in the tree and may be pre-computed and reused for multiple batches. 
Pre-computed values in the batch tree are denoted with capital letters, and 
values that are computed in a particular decryption are denoted with lower-case 
letters. 

The batching algorithm consists of three phases; an upward-percolation 
phase, an exponentiation phase, and a downward-percolation phase. In the 
upward-percolation phase, the individual encrypted messages v,- are combined to 
form, at the root of the batch tree, the value 

v=n?^< <, ' wheree= nf=i^- 

In preparation, assign to each leaf node a public exponent: E <— e t . Each inner 
node then has its E computed as the product of those of its children: E<^E L - 
E R . The root node f s E will be equal to e 9 the product of all the public exponents. 
Each encrypted message v t is placed (as v) in the leaf node labeled with its 
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corresponding e t . The v's are percolated up the tree using the following recursive 
step, applied at each inner node: 

At the completion of the upward-percolation phase, the root node 
5 contains v = Y\ f =1 vf e( * In the exponentiation phase, the e * root of this v is 

extracted. Here, the knowledge of factorization of AT is required. The 
exponentiation yields y 1/e = Y\ \-\ v \ /&l > w hich is stored as m in the root node. 

In the downward-percolation phase, the intent is to break up the product 
m into its constituent subproducts m L and /?z R; and, eventually, into the 
10 decrypted messages mi at the leaves. At each inner node anJ is chosen 

satisfying the following simultaneous congruencies: 

X= 0 (mod E L ) X= 1 (mod E R ). 

The value Xis constructed using the Chinese Remainder Theorem ("CRT"). 
Two further numbers, Xl andX R , are defined at each node as follows: 
15 X L = X/£ L X R = (X- 1)/£ R . 

Both divisions are done over the integers. (There is a slight infelicity in the 
naming here: X L and Xr are not the same as the Xs of the node's left and right 
children, as implied by the use of the L and R subscripts, but separate values.) 
The values of X, Xl, and Xr are such that, at each inner node, m x equals 

20 V ^ L • v R R • m R « This immediately suggests the recursive step used in 

downward-percolation: 

X I X X \ 

m r <— m /\v L L *v R R ] m L <— m I m R • 
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At the end of the downward-percolation process, each leafs m contains 
the decryption of the v placed there originally. Only one large (full-size) 
exponentiation is needed, instead of b of them. In addition, the process requires 
a total of 4 small exponentiations, 2 inversions, and 4 multiplications at each of 
5 the 6-1 inner nodes. 

Basic batch RS A is fast with very large moduli, but may not provide a 
significant speed improvement for common sized moduli. This is because 
batching is essentially a tradeoff. Batching produces more auxiliary operations 
in exchange for fewer full-strength exponentiations. 

10 Batching in an SSL-enabled web server focuses on key sizes generally 

employed on the web, e.g., n = 1024 bits. Furthermore, this embodiment also 
limits the batch size b to small numbers, on the order of b = 4, since collecting 
large batches can introduce unacceptable delay. For simplicity of analysis and 
implementation, the values of b are restricted to powers of 2. 

15 Previous schemes perform two divisions at each internal node, for a total 

of 2b - 2 required modular inversions. Modular inversions are asymptotically 
faster than large modular exponentiations. In practice, however, modular 
inversions are costly. Indeed, the first implementation (with b — 4 and a 1024- 
bit modulus) spends more time doing the inversions than doing the large 

20 exponentiation at the root. Two embodiments, when combined, require only a 

single modular inversion throughout the algorithm with the cost of an additional 
0(b) modular multiplication. This tradeoff gives a substantial running-time 
improvement. 
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The first embodiment is referred to herein as delayed division. An 
important realization about the downward-percolation phase is that the actual 
value of m for the internal nodes of the tree is consulted only for calculating wl 
and /wr. An alternative representation of m that supports the calculation of 
5 and m R? and that can be evaluated at the leaves to yield m would do just as well. 

This embodiment converts a modular division alb to a "promise," (a, b). 
This promise can operate as though it were a number, and, can "force" getting 
its value by actually computing b' x a. Operations on these promises work in a 
way similar to operations in projective coordinates as follows: 

a/b = {a,b) {a,b) c =(a c ,b c ) 

10 c>(a,b) = (ac, b) (a, b) • <c, d) = {ac, bd) • 

(a, b)l ' c = (a, be) (a, b) l(c, d)~ (ad, be) 

Multiplication and exponentiation takes twice as much work had these promises 
not been utilized, but division can be computed without resort to modular 
inversion. 

If, after the exponentiation at the root, the root m is expressed as a 
15 promise, m <— {m, 1), this embodiment can easily convert the downward- 

percolation step to employ promises: 

m R <r-m A /(v^ L -Vjf* ) m L <f-m/m R . 

No internal inversions are required. The promises can be evaluated at the leaves 
to yield the decrypted messages. 
20 Batching with promises uses b - 1 additional small exponentiations and 

b - 1 additional multiplications. This translates to one exponentiation and one 
multiplication at every inner node, saving 2(b -l)-6 = b- 2 inversions. To 
further reduce the number of inversions, another embodiment uses batched 
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divisions. When using delayed inversions one division is needed for every leaf 
of the batch tree. In the embodiment using batched divisions, these b divisions 
can be done at the cost of a single inversion with a few more multiplications. 

As an example of this embodiment, invert three values x 9 y, and z. 
Continue by forming the partial product yz 9 xz 9 mdxy and then form the total 
product xyz and invert it, yielding (xyz)' 1 . With these values, calculate all the 
inverses: 



z" 1 = (xy) • {xyz)' 1 

Thus the inverses of all three numbers are obtained at the cost of only a single 
modular inverse along with a number of multiplications. More generally, it can 
be shown that by letting xi, . . . , x n e Zn, all n inverses x" 1 , . . . ,x~* can be 
obtained at the cost of one inversion and 3n - 3 multiplications. 

It can be proven that a general batched-inversion algorithm proceeds in 
three phases. First, set ,4* <- x h and ,4; <- x ( * • for i > 1. By induction, it can 
be shown that 



x" 1 = (yz) - (xyz)' 



A 



y X = (xz) • {xyzy\ 



Next, invert 



A n = j~J xj <> and store the result in 



n 




shown that 
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Finally, set Q <r~ B\, and Q <~- • B t for i > 1 . Furthermore, C\ = B\ = 
x\~ l , and, by combining, Q =Ai.\ - Bi = xf 1 for z > L This embodiment has thus 
inverted each x;. 

Each phase above requires n - 1 multiplications, since one of the n 
5 values is available without recourse to multiplication in each phase. Therefore, 

the entire algorithm computes the inverses of all the inputs in In - 3 
multiplications and a single inversion. 

In another embodiment batched division can be combined with delayed 
division, wherein promises at the leaves of the batch tree are evaluated using 
10 batched division. Consequently, only a single modular inversion is required for 

the entire batching procedure. Note that the batch division algorithm can be 
easily modified to conserve memory and store only n intermediate values at any 
given time. 

The Chinese Remainder Theorem is typically used in calculating RSA 
15 decryptions. Rather than computing m <- v d (mod iV), the modulo p and q is 

evaluated: 

m p v* p (modi?) m q <- y ^ (mod q). 

Here d p = d mod p-l and d q = d mod q - 1 . Correspondingly the CRT can 
calculate m from m p and m q . This is approximately 4 times faster than 
20 evaluating m directly. 

This idea extends naturally to batch decryption. La one embodiment 
each encrypted message v x modulo p and q is reduced. Then, instead of using a 
single batch tree modulo N, two separate, parallel batch trees, modulo p and q, 
are used and then combined to the final answers from both using the CRT. 
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Batching in each tree takes between a quarter and an eighth as long as in the 
original, unified tree since the number-theoretical primitives employed, as 
commonly implemented, take quadratic or cubic time in the bit-length of the 
modulus. Furthermore, the b CRT steps required to calculate each mi mod N 
5 afterwards take negligible time compared to the accrued savings. 

Another embodiment referred to herein as Simultaneous Multiple 
Exponentiation provides a method for calculating a u • & v mod m without first 
evaluating a u -b v - It requires approximately as many multiplications as does a 
single exponentiation with the larger of u or v as an exponent. 

10 For example, in the percolate-upward step, y y£* . y^ , the entire 

right-hand side can be computed in a single multi-exponentiation. The 
percolate-downward step involves the calculation of the quantity 
v l L * v r R ' w ^ c h can be accelerated similarly. These small-exponentiations-and- 
product calculations are a larger part of the extra bookkeeping work required for 
15 batching. Using Simultaneous Multiple Exponentiation reduces the time 

required to perform them by close to 50% by combining the exponentiation 
process. 

Yet another embodiment involves Node Reordering. Normally there are 
two factors that determine performance for a particular batch of keys. First, 
20 smaller encryption exponents are better. The number of multiplications 

required for evaluating a small exponentiation is proportional to the number of 
bits in the exponent. Since upward and downward percolation both use O(b) 
small exponentiations, increasing the value of e ~ Y\ e t can have a drastic effect 
on the efficiency of batching. 
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Second, some exponents work well together. In particular, the number 
of multiplications required for a Simultaneous Multiple Exponentiation is 
proportional to the number of bits in the larger of the two exponents. If batch 
trees are built that have balanced exponents for multiple exponentiation {El and 
5 E R , then Xi and Xr, at each inner node), the multi-exponentiation phases can be 

streamlined. 

With b = 4, optimal reordering is fairly simple. Given public exponents 
e \ < e 2 < e s < e 4 ' the airangOTrart ^ _^ _^ _^ minimizes the disparity 
between the exponents used in Simultaneous Multiple Exponentiation in both 

10 upward and downward percolation. Rearranging is harder for b > 4. 

Figure 2 is an embodiment of a system 200 for improving secure 
communications. The system includes multiple client computers 232, 234, 236, 
238 and 240 which are coupled to a server system 210 through a network 230. 
The network 230 can be any network, such as a local area network, a wide area 

1 5 network, or the Internet. Coupled among the server system 210 and the network 

230 is a decryption server. While illustrated as a separate entity in Figure 2, the 
decryption server can be located independent of the server system or in the 
environment or among any number of server sites 212, 214 and 216. The client 
computers each include one or more processors and one or more storage 

20 devices. Each of the client computers also includes a display device, and one or 

more input devices. All of the storage devices store various data and software 
programs. In one embodiment, the method for improving secure 
communications is carried out on the system 200 by software instructions 
executing on one or more of the client computers 232 - 240. The software 
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instructions may be stored on the server system 210 any one of the server sites 
212 - 216 or on any one of the client computers 232 - 240. For example, one 
embodiment presents a hosted application where an enterprise requires secure 
communications with the server. The software instructions to enable the 
5 communication are stored on the server and accessed through the network by a 

client computer operator of the enterprise. In other embodiments, the software 
instructions may be stored and executed on the client computer. A user of the 
client computer with the help of a user interface can enter data required for the 
O execution of the software instructions. Data required for the execution of the 

ft 10 software instructions can also be accessed via the network and can be stored 

Id anywhere on the network. 

f|i Building the batch RS A algorithm into real-world systems presents a 

O number of architectural challenges. Batching, by its very nature, requires an 

y aggregation of requests. Unfortunately, commonly-deployed protocols and 

|TT 15 programs are not designed with RS A aggregation in mind. The solution in one 

embodiment is to create a batching server process that provides its clients with a 
decryption oracle, abstracting away the details of the batching procedure. 

With this approach modifications to the existing servers are minimized. 
Moreover, it is possible to simplify the architecture of the batch server itself by 
20 freeing it from the vagaries of the SSL protocol. An example of the resulting 

web server design is shown in Figure 3. Note that in batching the web server 
manages multiple certificates, i.e., multiple public keys, all sharing a common 
modulus N310. 
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One embodiment for managing multiple certificates is the two-tier 
model. For a protocol that calls for public-key decryption, the presence of a 
batch-decryption server 320 induces a two-tier model. First is the batch server 
process that aggregates and performs RSA decryptions. Next are client 
5 processes that send decryption requests to the batch server. These client 

processes implement the higher-level application protocol (e.g., SSL) and 
interact with end-user agents (e.g., browsers). 

Hiding the workings of the decryption server from its clients means that 
adding support for batch RSA decryption to existing servers engenders the same 

10 changes as adding support for hardware-accelerated decryption. The only 

additional challenge is in assigning the different public keys to the end-users 
such that there are roughly equal numbers of decryption requests with each e z -. 
As the end-user response times are highly unpredictable, there is a limit to the 
flexibility that may be employed in the public key distribution. 

15 If there are k keys each with a corresponding certificate, it is possible to 

create a web with ck web server processes with a particular key assigned to 
each. This approach provides that individual server processes need not be 
aware of the existence of multiple keys. The correct value for c depends on 
factors such as, but not limited to, the load on the site, the rate at which the 

20 batch server can perform decryption, and the latency of the communication with 
the clients. 

Another embodiment accommodates workload unpredictability. The 
batch server performs a set of related tasks including receiving requests for 
decryption, each of which is encrypted with a particular public exponent 
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Having received the requests it aggregates these into batches and performs the 
batch decryption as described herein. Finally, the server responds to the 
requests for decryption with the corresponding plain-text messages. The first 
and last of these tasks are relatively simple I/O problems and the decryption 
5 stage is discussed herein. What remains is the scheduling step. 

One embodiment possesses scheduling criteria including maximum 
throughput, minimum turnaround time, and minimum turnaround-time variance. 
The first two criteria are self-evident and the third is described herein. Lower 
turnaround-time variance means the server's behavior is more consistent and 
10 predictable which helps prevent client timeouts. It also tends to prevent 

starvation of requests, which is a danger under more exotic scheduling policies. 

Under these constraints a batch server's scheduling can implement a 
queue where older requests are handled first. At each step the server seeks the 
batch that allows it to service the oldest outstanding requests. It is impossible to 
1 5 compute a batch that includes more than one request encrypted with any 

particular public exponent e h This immediately leads to the central realization 
about batch scheduling that it makes no sense, in a batch, to service a request 
that is not the oldest for a particular e f . However, substituting the oldest request 
for a key into the batch improves the overall turnaround-time variance and 
20 makes the batch server better approximate a perfect queue. 

Therefore, in choosing a batch, this embodiment needs only consider the 
oldest pending request for each e ( . To facilitate this, the batch server keeps k 
queues Qu or one for each key. When a request arrives, it is placed onto the 
queue that corresponds to the key with which it was encrypted. This process 
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takes 0(1) time. In choosing a batch, the server examines only the heads of 
each of the queues. 

Suppose that there are k keys, with public exponents e\ 9 . . ., e& and that 
the server decrypts requests in batches of b messages each. The correct requests 
5 to batch are the b oldest requests from amongst the k queue heads. If the 

request queues Q t are kept in a heap with priority determined by the age of the 
request at the queue head, then batch selection can be accomplished by 
extracting the maximum, oldest-head, queue from the heap, de-queue the 
request at its head, and repeat the process to obtain b requests to batch. After 

10 the batch has been selected, the b queues from which requests were taken may 

be replaced in the heap. The entire process takes 0(blgk) time. 

Another embodiment utilizes multi-batch scheduling. While the process 
described above picks only a single batch, it is possible, in some cases, to 
choose several batches at once. For example, with b ~ 2, k = 3, and requests for 

15 the keys 3-3-5-7 in the queues, the one-step lookahead may choose to do a 5-7 

batch first, after which only the unbatchable 3-3 remain. A smarter server 
could choose to do 3-5 and 3-7 instead. The algorithms for doing lookahead 
are more complicated than the single-batch algorithms. Additionally, since they 
take into account factors other than request age, they can worsen turnaround- 

20 time variance or lead to request starvation. 

A more fundamental objection to multi-batch lookahead is that 
performing a batch decryption takes a significant amount of time. Accordingly, 
if the batch server is under load, additional requests will arrive by the time the 
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first chosen batch has been completed. These can make a better batch available 
than was without the new requests. 

But servers are not always under maximal load. Server design must take 
different load conditions into account. One embodiment reduces latency in a 
medium-load environment by using k public keys on the web server and 
allowing batching of any subset of b of them, for some b < k. To accomplish 
this the batches must be pre-constructed and the constants associated with^ ) 

batch trees must be keep in memory one for each set of e f s. 

However, it is no longer necessary to wait for exactly one request with 
each e before a batch is possible. For k keys batched b at a time, the expected 
number of requests required to give a batch is 



This equation assumes each incoming request uses one of the k keys 
randomly and independently. With 6 = 4, moving from k = 4 to k = 6 drops the 
expected length of the request queue at which a batch is available by more than 
31% from 8.33 to 5.70. 

The particular relationship of b and k can be tuned for a particular 
server. The batch-selection algorithm described herein is time-performance 
logarithmic in k, so the limiting factor on k is the size of the k^ prime, since 
particularly large values of e degrade the performance of batching. 

In low-load situations, requests trickle in slowly, and waiting for a batch 
to be available can introduce unacceptable latency. A batch server should have 
some way of falling back on unbatched RSA decryption, and, conversely, if a 
batch is available and batching is a better use of processor time than unbatched 
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RSA, the servers should be able to exploit these advantages. So, by the 
considerations given above, the batch server should perform only a single 
unbatched decryption, then look for new batching opportunities. 

Scheduling the unbatched decryptions introduces some complications. 
5 Previous techniques in the prior art provide algorithms that when requests 

arrive, a batch is accomplished if possible, otherwise a single unbatched 
decryption is done. This type of protocol leads to undesirable real-world 
behavior. The batch server tends to exhaust its queue quickly. Furthermore it 
responds immediately to each new request and never accumulates enough 

1 0 requests to batch. 

One embodiment chooses a different approach that does not exhibit the 
performance degradation associated with the prior art. The server waits for new 
requests to arrive, with a timeout. When new requests arrive, it adds them to its 
queues. If a batch is available, it evaluates it. The server falls back on 

15 unbatched RSA decryptions only when the request-wait times out. This 

approach increases the server's turnaround-time under light load, but scales 
gracefully in heavy use. The timeout value is tunable. 

Since modular exponentiation is asymptotically more expensive than the 
other operations involved in batching, the gain from batching approaches a 

20 factor-of-6 improvement only when the key size is improbably large. With 

1024-bit RSA keys the overhead is relatively high and a naive implementation 
is slower than unbatched RSA. The improvements described herein lower the 
overhead and improve performance with small batches and standard key-sizes. 
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Batching provides a sizeable improvement over plain RSA with 6 = 8 
and n = 2048. More important, even with standard 1024-bit keys, batching 
significantly improves performance. With 6 = 4, RSA decryption is accelerated 
by a factor of 2.6; with b = 8, by a factor of almost 3.5. These improvements 
5 can be leveraged to improve SSL handshake performance. 

At small key sizes, for example n - 512, an increase in batch size 
beyond b~4 provides only a modest improvement in RSA performance. 
Because of the increased latency that large batch sizes impose on SSL 
handshakes, especially when the web server is not under high load, large batch 
10 sizes are of limited utility for real-world deployment. 

SSL handshake performance improvements using batching can be 
demonstrated by writing a simple web server that responds to SSL handshake 
requests and simple HTTP requests. The server uses the batching architecture 
described herein. The web server is a pre-forked server, relying on "thundering 
15 herd" behavior for scheduling. All pre-forked server processes contact an 

additional batching server process for all RSA decryptions as described herein. 

Batching increases handshake throughput by a factor of 2.0 to 2.5, 
depending on the batch size. At better than 200 handshakes per second, the 
batching web server is competitive with hardware-accelerated SSL web servers, 
20 without the need for the expensive hardware. 

Figure 4 is a flow diagram for improving secure socket layer 
communication through batching of an embodiment. As in a typical initial 
handshake between server and client in establishing a secure connection, the 
client uses the servers public key to encrypt a random string R and then sends 
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the encrypted R to the server 420. The message is then cached 425 and the 
batching process begins by determining is there is sufficient encrypted messages 
coming into the server to form a batch 430. If the answer to that query is no, it 
is determined if the scheduling algorithm has timed out 440. Again if the 
5 answer is no the message returns to be held with other cached messages until a 

batch has been formed or the scheduler has timed out. If the scheduler has 
timed out 440 then the web server receives the encrypted message from the 
client containing R 442. The server then employs the private key of the public / 
private RSA key pair to decrypt the message and determine R 446. With R 

10 determined the client and the server use R to secure further communication 485 

and establish an encrypted session 490. 

Should enough encrypted messages be available to create a batch 430 
the method examines the possibility of scheduling multiple batches 450. With 
the scheduling complete the exponents of the private key are balanced, 455, and 

15 the e th root of the combined messages is extracted 458 allowing a common root 

to be determined and utilized 460. The embodiment continues by reducing the 
number of inversions by conducting delayed division 462 and batched division 
468. With the divisions completed, separate parallel batch trees are formed to 
determine the final inversions that are then combined 470. At this point 

20 simultaneous multiple exponents are applied to decrypt the messages 472 which 

are separated 476 and sent to the server in clear text 480. With the server and 
client both possessing the session key R 485 a encrypted session can be 
established 490. 
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Batching increases the efficiency and reduces the cost of decrypting the 
cipher-text message containing the session's common key. By combining the 
decryption of several messages in an optimized and time saving manner the 
server is capable of processing more messages thus increasing bandwidth and 
5 improving the over all effectiveness of the network. While the batching 

techniques described previously are a dramatic improvement in secure socket 
layer communication, other techniques can also be employed to improve the 
handshake protocol. 

Another embodiment for the improvement to the handshake protocol 

1 0 focuses on how the web server generates its RS A key and how it obtains a 

certificate for its public key. By altering how the browser uses the server's 
public key to encrypt a plain-text R, and how the web server uses its private key 
to decrypt the resulting cipher-text C, this embodiment provides significant 
improvements to SSL communications. 

15 In one embodiment a server generates an RS A public/private key pair by 

generating two distinct n-bit primes p and q and computing N =pq. While N 
can be of any arbitrary size, assume for simplicity that N is 1024 bits long and 
let w = gcd(/? - 1, # - 1) where gcd is the greatest common divisor. The server 
then picks two random k-bit values r h r 2 such that gcd(n, p - 1) = 1, gcd(r 2 , q - 

20 1) = 1, and r\ = r 2 mod w. Typically k falls in the range of 160 -512 bits in size. 

Although other larger values are also acceptable, k is minimized to enhance 
performance. The server then computes d such that d = n mod p—1 and d = r 2 
mod q ~ 1 . Having computed d, e* is found by solving the equation e* = cT 1 mod 
<p{N) resulting in the public key being (TV, e r ) and the private key (n, r 2 ). 
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The server then sends the public key to a Certificate Authority (CA). 
The CA returns a public key certificate for this public key even though e f is very 
large, namely on the order of N. This is unlike standard RSA public key 
certificates that use a small value of e, e.g. e = 65537. Consequently, the CA 
must be willing to generate certificates for such keys. 

To find d the Chinese Remainder Theorem is typically used. 
Unfortunately, p—l and q — 1 are not relatively prime (they are both even) and 
consequently the theorem does not apply. However, by letting w = gcd(p ~l,q 

- 1), knowing that — — - and - — - are relatively prime, and recalling that r\ = 
w w 

rz = a mod w, the CRT can be used to find an element d' such that 



w 
r 2 -a 



mod- 
mod— — 



w 



Observing that the required d is simply d=w • d' + a and indeed, 
d = r\ mod p- \ and d = r% mod q - 1, if w is large, the requirement that 
r\ = r-i mod w reduces the entropy of the private key. For this reason it is 
desirable to ensure that w is small and one embodiment lets w = 2, or namely 
that gcd(p - 1, q - 1) = 2. Recall that gcd(r u p~ 1) = 1 and gcd(r 2? q - 1) = 1. 
It follows that gcd(d,p ~ 1) = 1 and gcd(d, q ~ 1) = 1 and consequently 
gcd(rf, (/? - 1)(^ - 1)) = 1. Hence, d is invertible modulo <p(N) = (p- l)(q - 1). 

The web browser obtains the server's public key certificate from the 
server-hello message. In this embodiment, the certificate contains the server's 
public key (N, e). The web browser encrypts the pre-master-secret R using this 
public key in exactly the same way it encrypts using a normal RSA key. Hence, 
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there is no need to modify any of the browser's software. The only issue is that 
since e r is much larger than e in a normal RS A key, the browser must be willing 
to accept such public keys. 

When the web server receives the cipher-text C from the web browser 
5 the web server then uses the server's private key, (j\ 9 r 2 ), to decrypt C. To 

accomplish this the server computes R{ = C Vl mod^? and R' 2 - C r2 modq. 
Using CRT the server then computes an R e Zn such that r = R{ mo d p and 
R = R' 2 mod q 9 noting that R = Cf mod N. Hence, the resulting R is a proper 
decryption of C. 

10 Decryption using a standard RSA public key is completed with Cf mod 

N using the CRT. Typically R x = & modp ~ 1} mod p and R 2 = << d mod 9 ~~ 1} mod 
q is first computed and then the CRT is applied to R\ 9 R 2 to obtain R mod N. 
Note that the exponents d mod p—1 and mod # — 1 are typically as large as 
p and q 9 namely 512 bits each. Hence, to generate the signature the server must 

15 compute one exponentiation modulo p and one exponentiation modulo q. When 

iVis 1024 bits, the server does two full exponentiations modulo 512-bit 
numbers. 

In one embodiment, the server computes r{ 9 R' 2 and then applies CRT 
to R[ ; R' 2 . As in normal RSA, the bulk of the work is in computing R[ 5 R' 2 . 
20 However, computing R{ requires raising C to the power of r\ 9 which minimized. 

Since the time that modular exponentiation takes is linear in time to the size of 
the exponent, computing i?j takes approximately one third the work and one 
third of the time of raising C to the power of a 512 bit exponent. Hence, 
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computing j?j takes one third the work of computing R\. Therefore, during the 
entire decryption process the server does approximately one third the work as in 
a normal SSL handshake. 

To illustrate the implementation of this embodiment suppose Eve is an 
eavesdropper that listens on the network while the handshake protocol is taking 
place. Eve sees the server's public key {N, e f ) and the encrypted pre-master- 
secret C. Suppose r\ < r 2 . It can be shown that an adversary who has (N, e\ 0 
can mount an attack on the system that runs in time o{Jr\ log r x )* 




Let (N t e ') be an RS A public key with N = pq and let d e Z be the 
corresponding RSA private key satisfying d = r 1) mod p - 1 and d = r 2 mod q - 
1 with r } <r 2 . If ri is m bits long and it is assumed that rj ^ r 2 mod 2 m/2 , then 
given (N, e')z& adversary can expose the private key d in time o{jr\ log r x ) - 
One skilled in the art knows that e' - (n)" 1 mod (p - 1). But, suppose n is m- 
bits long. If n - A < 2 m/2 + B where A, B are in [0, 2 m/2 ] and a random g e Z N is 
selected combined with the definition 



i=0 

then if follows that G(g e B ) = Q mod p. This occurs since one of the products 
above is 



Since r x * r 2 mod 2 m/1 , it can be shown that G(g e ' B )*Q mod q. Hence, gcd (N, 
G(g e ' B )) gives a non-trivial factor of N. Hence, if G(x) mod N is evaluated at x 
= g e ' ' J for / = 0, ... , 2 m/2 at least one of the values will expose the factorization 






(mod jo)- 
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of AT. Evaluating a polynomial of degree 2 at 2 values can be done in time 
2 mI2 • /w/2 using Fast Fourier Transform methods. This algorithm requires 
Q^2*n/2 j S p ace> Hence, in time at most o{y[r\ log r x ) we can factor N. Thus in 

order to obtain security of 2 80 , both r\ and r 2 must be at least 160 bits long. 

Figure 5 is a flow diagram for improving secure socket layer 
communications of an embodiment by altering the public / private key pair. In 
operation, the server generates an RSA public / private key pair initiating a 
normal initial handshake protocol 510. At this point the server generates two 
distinct prime numbers 515 and takes the product of the numbers to produce the 
N component of the public key 520. Similarly, the server picks two random 
values to create the private key 525. Using the prime numbers 515 and the 
random values of the private key 525, the server computes the value d 530 and 
correspondingly the value e 535. The result is a new public / private key pair 
540 that the client uses to encrypt the pre-master-secret R 550. Once R has 
been encrypted with the new public key and transmitted to the server as cipher- 
text C, the server uses it private key to decrypt the pre-master-secret 560. Once 
Ri and R 2 have been determined 565 they are combined to find R 570. Having 
the value of the pre-master-secret intact, the server and client can establish a 
secure session 580. 

A further embodiment dealing with the handshake protocol reduces the 
work per connection on the web server by a factor of two. This embodiment 
works with all existing browsers. As before, the embodiment is illustrated by 
describing how the web server generates its RSA key and obtains a certificate 
for its public key. This embodiment continues in describing how the browser 
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uses the server's public key to encrypt a plain-text R, and the server uses its 
private key to decrypt the resulting cipher-text C. 

In this embodiment the server generates an RSA public/private key pair 
by generating two distinct n-bit primes p and q such that the size of each distinct 
prime number is on the order of one third of the size of N. Using this 
relationship the server computes N as N =p 2 -q. The relationship between the 
prime numbers and N is dependent on the power by which one of the prime 
number is raised. For example if one of the prime numbers was raised to the 
fourth power the prime numbers would be on the order of one fifth the size of 
N. The exponent of at least one of the prime numbers must be greater than one. 
While clearly N can be of arbitrary size, assume for simplicity that N* is 1024 
bits long, and hence p and q are 341 bits each. The server uses the same e used 
in standard RSA public keys, namely e = 65537 as long as gcd(>, (p-l)(q~ 
1)) = 1. The server then computes d = e~ l mod (p - \){q - 1) as well as r x = d 
mod p - 1 and r 2 = d mod q - 1 . With the public key being (N 7 , e) and the 
private key being {r h r 2 \ the server sends the public key, (TV, e), to a Certificate 
Authority (CA) and the CA returns a public key certificate. The public key in 
this case cannot be distinguished from a standard RSA public key. 

The web browser obtains the server's public key certificate from the 
server-hello message. The certificate contains the server's public key (N 7 , e). 
The web browser encrypts the pre-master-secret R using this public key in 
exactly the same way it encrypts using a normal RSA key. 

When the web server receives the cipher-text C from the web browser 
the web server decrypts Cby computing R[ = c r * mod p and R f 2 = c' 2 mod# - 
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Note that (r; ) e = Cmodp and (r' 2 ) e = C mod q. Lifting the server constructs 
an such that (r;') 6 - c modp 2 • More precisely, the server computes 

Using CRT, the server computes an R e Z N such that R = Rfmodp 2 and 
R = R' 2 mod q noting that R = C 1 mod N. Hence, the resulting R is a proper 
decryption of C. Recall that when TV is 1024 bits, the server does two full 
exponentiations modulo 512-bit numbers. 

In this embodiment the server computes R[ 9 R f 2 ? and then applies 
CRT to R^ R' 2 . The bulk of the work is in computing r[ 9 r' 2 ? r» but 
computing R[ requires a full exponentiation modulo a 341 -bit prime rather than 
a 5 12-bit prime. The same holds for R' 2 . Hence in this embodiment, 
computing R f x , r' 2 takes approximately half the time of computing R u R 2 . 
Furthermore, computing r» from R{ only requires a modular inversion modulo 
p 2 . This takes little time when compared with the exponentiations for 
computing R[ , r' 2 . Hence, using this embodiment the handshake takes 
approximately half the work of a normal handshake on the server. 

Some accelerator cards do not provide support for modular inversion. 
As a result, the inversion is preformed using an exponentiation. This is done by 
observing that for any ie Z* the inverse of x is given by: 

x -i =x p 2 -p-! ( modp 2). 

Unfortunately, using an exponentiation to do the inversion hurts performance. 
As discussed herein a better embodiment for inversion in this case is batching. 
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One can invert two numbers Xl , x 2 e Z* p as a batch faster than inverting the two 
numbers separately. To do so use the fact that 

x \ l = x 2 -fe^)" 1 and X 2 X =x i *( x i x 2) _1 (modp 2 )- 
Hence, at the cost of inverting x\ • X2 it is possible to invert both x\ and x 2 . This 
5 embodiment shows that an inversion of k elements x x, e Z* is at the cost 

of one inversion and k log 2 k multiplications. Thus, the amortized cost of a 
single inversion is \/k of an exponentiation plus log 2 £ multiplications. 

To take advantage of batched inversion in the SSL handshake a number 
of instances of the handshake protocol are collected from among different users 
10 and the inversion is preformed on all handshakes as a batch. As a result, the 

amortized total number of exponentiations per handshake is 2 + — . This 

k 

approximately gives a factor of two improvement in the handshake work on the 
server as compared to the normal handshake protocol 

The security of the improved handshake protocol depends on the 

15 difficulty of factoring integers of the form N=p 2 • q. When 1024 bit keys are 

used the fastest factoring algorithms (i.e. the number field sieve) cannot take 
advantage of the special structure of N. Similarly, p and q are well beyond the 
capabilities of the Elliptic Curve Method (ECM). 

Figure 6 is a flow diagram for modifying the public key of an 

20 embodiment to facilitate an improvement in secure socket layer communication. 

As in other embodiments, the process begins with the servers generation of a 
RS A public / private key pair 610. In this embodiment, the public key is 
modified. The web server generates two distinct prime numbers 612 and 
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computes a new N' 618. Using the same exponent 620 the server computes the 
value d 622 which it uses to find the private key 628. The result is a pubic / 
private key combination 630 that the sever then sends to the client for the 
encryption of the pre-master-secret 640. Once the server receives the encrypted 
pre-master-secret, R, from the client 650 the server decrypts R 660 by 
computing Rl 662 and R2 668 and combining the results 670. Once R has been 
determined the client can establish a secure session with the client using the 
new session key 680. 

From the above description and drawings, it will be understood by those 
of ordinary skill in the art that the particular embodiments shown and described 
are for purposes of illustration only and are not intended to limit the scope of 
the claimed invention. 
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